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Abstract 

Wc give the first ii-sketching algorithm for integer vectors which produces nearly optimal 
sized sketches in nearly linear time. This answers the first open problem in the list of open 
problems from the 2006 IITK Workshop on Algorithms for Data Streams. Specifically, suppose 
Alice receives a vector x G {— M, . . . , M}" and Bob receives y £ {—M, . . . , M}", and the two 
parties share randomness. Each party must output a short sketch of their vector such that a third 
party can later quickly recover a (1 ± e)-approximation to \ \x — y\\i with 2/3 probability given 
only the sketches. We give a sketching algorithm which produces 0{e~^ log(l/e) log(nM))-bit 
sketches in 0(nlog^(riM)) time, independent of e. The previous best known sketching algorithm 
for Li is due to [Feigenbaum et ai, SICOMP 2002], which achieved the optimal sketch length 
of 0(£~^ log(nM)) bits but had a running time of 0(n log(nM)/e^). Notice that our running 
time is near-linear for every e, whereas for sufficiently small values of e, the running time of the 
previous algorithm can be as large as quadratic. Like their algorithm, our sketching procedure 
also yields a small-space, one-pass streaming algorithm which works even if the entries of a;, y 
are given in arbitrary order. 

1 Introduction 

Space and time-efficient processing of massive databases is a challenging and important task in ap- 
plications sucli as observational sciences, product marketing, and monitoring large systems. Usually 
the data set is distributed across several network devices, each receiving a portion of tlie data as a 
stream. The devices must locally process their data, producing a small sketch, which can then be 
efficiently transmitted to other devices for further processing. Although much work has focused on 
producing sketches of minimal size for various problems, in practice the time efficiency to produce 
the sketches is just as important, if not more so, than the sketch size. 

In the 2006 IITK Workshop on Algorithms for Data Streams, the first open question posed |15j 
was to find a space- and time-efficient algorithm for Li-difference computation. Formally, there 
are two parties, Alice and Bob, who have vectors x,y £ {— M, ...,M}" and wish to compute 
sketches s(x) and s{y), respectively, so that a third party can quickly recover a value Z with 
(1 — e) \\x — y\\i < Z < (l-|-e) ||x — Here, ||x — y\\i = Y17=i y\i denotes the Li-norm of the 
vector X — y. The third party should succeed with probability at least 2/3 over the randomness of 
Alice and Bob (this probability can be amplified by repeating the process and taking the median). 

The original motivation jTl] for the Li-difference problem is Internet-traffic monitoring. As 
packets travel through Cisco routers, the NetFlow software j23j produces summary statistics of 
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groups of packets with the same source and destination IP address. Such a group of packets is 
known as a flow. At the end of a specified time period, a router assembles sets of values (s, ftis)), 
where s is a source-destination pair, and ft{s) is the total number of bytes sent from the source to 
the destination in time period t. The Li-difference between such sets assembled during different 
time periods or at different routers indicates differences in traffic patterns. 

The ability to produce a short sketch summarizing the set of values allows a central control 
and storage facility to later efficiently approximate the Li-difference between the sketches that it 
receives. The routers producing the sketches cannot predict which source-destination pairs they 
will receive, or in which order. Since the routers can transmit their sketches and updates to their 
sketches to the central processing facility in an arbitrarily interleaved manner, it is essential that the 
Li-difference algorithm support arbitrary permutations of the assembled sets of values. Because of 
the huge size of the packet streams, it is also crucial that the computation time required to produce 
the sketches be as small as possible. 

The first algorithm for this problem is due to Feigenbaum et al. [llj . and achieves a sketch 
of size 0(e~^ log(nM)) bits with each party requiring 0(nlog(nM)/e^) processing time (in word 
operations). Later, Indyk [16] generalized this to the problem of estimating the Li-norm of a 
general data stream with an arbitrary number of updates to each coordinate. For the Li-difference 
problem, the space of Indyk's method is worse than that of [11] by a factor of log(n), while the 
time complexity is similar. Recently, in |21j it was shown how to reduce the space complexity 
of Indyk's method, thereby matching the sketch size of [11]. However, the time complexity per 
stream update is Q{e~'^). Also in [21], a space lower bound of log(nM)) was shown for the 

-Li-difference problem for nearly the full range of interesting values for e, thereby showing that the 
space complexity of the algorithms of [11^ [2T] are optimal. 

While the space complexity for Li-difference is settled, there are no non-trivial lower bounds 
for the time complexity. Notice that the factor in the processing time can be a severe drawback 
in practice, and can make the difference between setting the approximation quality to e = .1 or 
to e = .01. Indeed, in several previous works (see the references in Muthukrishnan's book [22], 
or in Indyk's course notes [17]), the main goal was to reduce the dependence on e in the space 
and/or time complexity. This raises the question, posed in the IITK workshop, as to whether 
this dependence on e can be improved for Li-difference. As a first step, Cormode and Ganguly 
|14j show that if one increases the sketch size to e~^polylog(nM), then it is possible to achieve 
processing time n ■ poly log (nM), thereby removing the dependence on e. Their algorithm even 
works for Li-norm estimation of general data streams, and not just for Li-difference. However, for 
reasonable values of nM, the sketch size is dominated by the term, which may be prohibitive 
in practice, and is sub-optimal. 

In this paper we show how to achieve a near-optimal 0(e~^ log(l/e) log(nM)) sketch size, while 
simultaneously achieving a near linear 0(nlog^(nM)) processing time, independent of e. Notice 
our space is only a factor of log(l/e) more than the lower bound. The time for a third party to 
recover a (libe)-approximation to the Li-difference, given the sketches, is nearly linear in the sketch 
size. Furthermore, our sketching procedure naturally can be implemented as a one-pass streaming 
algorithm over an adversarial ordering of the coordinates oi x,y (this was also true of previous 
algorithms). Thus, up to small factors, we resolve the first open question of [15]. We henceforth 
describe our sketching procedure as a streaming algorithm. 

While in [H] and [l5] it is suggested to use the techniques of [5] and ^\ for estimating Lp 
norms, p > 2, which are themselves based on estimating coordinates of heavy weight individually 
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and removing them, we do not follow this approach. Moreover, the approach of [TT] is to embed 
Li-difference into L^, then use the AMS sketch [1] and range-summable hash functions they design 
to reduce the processing time. We do not follow this approach either. 

Instead, our first idea is to embed the Li-difference problem into Lq, the number of non- 
zero coordinates of the underlying vector (in this case x — y) presented as data stream. Such an 
embedding has been used before, for example, in lower bounding the space complexity of estimating 
Lq in a data stream [18]. Suppose for simplicity Xj, ?/j > for all i G [n]. Here the idea is for Alice to 
treat her input Xi as a set of distinct items M(z— ,M{i—l) + Xi, while Bob treats his input 
Hi as a set of distinct items M{i — 1) + 1,... , M{i — l) + yi. Then the size of the set-difference of these 
two sets is \xi — yi\. Thus, if Alice inserts all of the set elements corresponding to her coordinates 
as insertions into an Lo-algorithm, while Bob inserts all of his elements as deletions, the Lo-value 
in the resulting stream equals ||x — A recent space-efficient algorithm for estimating Lq with 
deletions is given in |21j . 

The problem with directly reducing to Lq is that, while the resulting space complexity is small, 
the processing time can be as large as 0{nM) since we must insert each set element into the 
-Lo-algorithm. We overcome this by developing a range-efficient Lq algorithm, i.e. an algorithm 
which allows updates to ranges at a time, which works for streams coming out of our reduction 
by exploiting the structure of ranges we update (all updated ranges are of length at most M and 
start at an index of the form M{i — 1) -|- 1). We note that range-efficient Lq algorithms have been 
developed before [21 [25], but those algorithms do not allow deletions and thus do not suffice for our 
purposes. 

At a high level, our algorithm works by sub-sampling by powers of 2 the universe [nM] arising 
out of our reduction to Lq. At each level we keep a data structure of size 0(e~^ log(l/e)) to 
summarize the items that are sub-sampled at that level. We also maintain a data structure on the 
side to handle the case when Lq is small, and we in parallel obtain a constant-factor approximation 
R of the Li-difference using At the stream's end, we give our estimate of the Li-difference 
based on the summary data structure living at the level where the expected number of universe 
elements sub-sampled is 0(l/e^) (we can determine this level knowing R). As is the case in 
many previous streaming algorithms, the sub-sampling of the stream can be implemented using 
pairwise-independent hash functions. This allows us to use a subroutine developed by Pavan and 
Tirthapura [25] for quickly counting the number of universe elements that are sub-sampled at each 
of the log(nM) levels. Given these counts, our summary data structures are such that we can 
update each one efficiently. 

Our summary data structure at a given level maintains {x' — y')H, where H is the parity-check 
matrix of a linear error-correcting code, and x', y' are the vectors derived from x, y by sub-sampling 
at that level. When promised that x', y' differ on few coordinates, we can treat x' — y' as a 
corruption of the encoding of the codeword then attempt to decode to recover the "error" x' — y'. 
The decoding succeeds as long as the minimum distance of the code is sufficiently high. This idea of 
using error-correcting codes to sketch vectors whose distance is promised to be small is known in the 
error-correcting codes literature as syndrome decoding |28j . Aside from error-correction, syndrome 
decoding has also found uses in cryptography: in the work of [U [26] to give a two-party protocol 
for agreeing on a shared secret key when communicating over a noisy channel, and in the work of 
[lOj as part of a private two-party communication protocol for computing the Hamming distance 
between two bitstrings. 

For efficiency reasons, our implementation of the summary data structure is mostly inspired 
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by work of Feigenbaum et al. |10j . Given that x' ,y' differ on at most k coordinates, they use the 
parity-check matrix of a Reed-Solomon code of minimum distance 0{k). Decoding can then be 
done in time 0{k'^ + k ■ polylog(A;) log(n)) using an algorithm of Dodis et al. [9]. In our application 
k = 6(e~^), and thus this recovery procedure is too slow for our purposes. To remedy this, we 
first hash the indices of x' ,y' into 0(e~^/ log(l/e)) buckets with an 0(log(l/e))-wise independent 
hash function, then in each bucket we keep the product of the difference vector, restricted to the 
indices mapped to that bucket, with the parity check matrix. With constant probability, no bucket 
receives more than 0(log(l/e)) indices where x' ,y' differ. We can thus use a Reed-Solomon code 
with minimum distance only 0(log(l/e)), making the algorithm of [9] fast enough for our purposes. 

We note that our summary data structure in each level is in fact a A;-set structure, as defined by 
Ganguly [13], that can be used to return a set of k items undergoing insertions and deletions in a data 
stream. While Ganguly's fe-set structure uses near-optimal space and has fast update time, it only 
works in the strict turnstile model (i.e., it requires that each coordinate ol z = x — y \s non-negative, 
in which case the Li-difference problem has a trivial solution: maintain an 0(log(nM))-bit counter). 
This is due to the algorithm's reliance on the identity (XlILi ^ " ^«)^ = (Sr=i ^i)^^l=i ' ^i)^ 
which there is no analogue outside the strict turnstile setting. Using certain modifications inspired 
by the Count-Min sketch j8] it may be possible to implement his algorithm in the turnstile model, 
though the resulting space and time would be sub-optimal. In a different work, Ganguly and 
Majumder [12] design a deterministic A;-set structure based on Vandermonde matrices, but the 
space required of this structure is sub-optimal. 

Our fast sketching algorithm for Li-difference improves the running time of an algorithm of 
Jayram and the second author [SOj by a factor for estimating Li{L2) of a matrix A, defined 

as the sum of Euclidean lengths of the rows of A. As Li-difference is a basic primitive, we believe 
our algorithm is likely to have many further applications. 

2 Preliminaries 

All space bounds mentioned throughout this paper are in bits, and all logarithms are base 2, unless 
explicitly stated otherwise. Running times are measured as the number of standard machine word 
operations (integer arithmetic, bitwise operations, and bitshifts). Each machine word is assumed 
to be Q.{\og{nM)) bits so that we can index each vector and do arithmetic on vector entries in 
constant time. Also, for integer A, [A] denotes the set {1, . . . ,^}. 

We now formally define the model in which our sketching procedure runs. Alice receives x S 
{— M, . . . , M}", and Bob receives y G {— M, . . . , M}". Both parties have access to a shared source 
of randomness and must, respectively, output bit-strings s{x) and s{y). The requirement is that a 
third party can, given access to only s{x) and s(y), compute a value Z such that Pr[|Z — | jx — y| |i| > 

e| \x — y\\i] < 1/3 (recall | \x — y\\i ^= "^l^i \xi — yi\)- The probability is over the randomness shared 
by Alice and Bob, and the value e € (0, 1] is a parameter given to all parties. The goal is to 
minimize the lengths of s{x) and s(y), as well as the amount of time Alice and Bob each take to 
compute them. Without loss of generality, throughout this document we assume > for all 
i. This promise can be enforced by increasing all coordinates of x, y by M, which does not alter 

— ylli. Doing so increases the upper bound on coordinate entries by a factor of two, but this 
alters our algorithm's running time and resulting sketch size by sub constant factors. 

Since we present our sketching algorithm as a streaming algorithm in Section[3l we now introduce 
some streaming notation. We consider a vector / = (/i, /2, • • • , /n) that is updated in a stream 
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as follows. The stream has exactly 2n updates {ii,vi), . . . , {i2n, V2n) G N x {—M^, ■ ■ ■ ■, M}- Each 
update {i,v) corresponds to the action fi^fi + v. For each j G [n], there are exactly two stream 
updates {i,v) with i = j. If these two stream updates are (iz-^jVz-^), {iz2,Vz2)^ then at most one of 
Vzi , Vz2 is negative, and at most one is positive. The nonnegative update corresponds to adding xi to 
fi, and the nonpositive update corresponds to subtracting yi from /j (recall we assumed Xi,yi > 0). 
We make no restriction on the possible values for zi and Z2- That is, our algorithm functions 
correctly even if the stream presents us with an adversarial permutation of the 2n coordinates 
xi, . . . , Xn, yi, ■ ■ ■ , Vn- At the end of the stream 1 1/| |i = | |x — y| |i, so our streaming algorithm must 
approximate ||/||i. For Alice and Bob to use our streaming algorithm for sketching, Alice runs the 
algorithm with updates (i, x,) for each 1 < « < and Bob separately runs the algorithm (using 
the same random bits) with updates {i,yi)- The sketches they produce are simply the contents of 
the algorithm's memory at the end of the stream. It is a consequence of how our algorithm works 
that these sketches can be efhciently combined by a third party to approximate ||/||i. 



3 Main Streaming Algorithm 

Throughout this section we assume e > Xj ^fn. Otherwise, we can compute ||/||i exactly by keeping 
the entire vector in memory using 0(n log M) = 0(e~^logAf) space with constant update time. 



3.1 Handling Small L\ 

We give a subroutine TwoLeveLEstimator, described in Figure [H to compute L\ exactly when 
promised that L\<k for some parameter fc > 2. We assume that integers polynomially large in k 
fit in a machine word, which will be true in our use of this subroutine later. 
In Figure [H we assume we have already calculated a prime p satisfying 

C<p<2C, C = 4 • (5 [log A:] +24)2- [A;/logfc] +1 (1) 

(the choice of p will be justified later), along with a generator g for the multiplicative group F*. 
We also precalculate logarithm tables Ti,T2 such that Ti[i] = mod p and T2[x] = dlog(x), where 
< i < p — 2 and 1 < a; < p — 1. Here dlog(x) is the discrete logarithm of x (i.e. the i G GF(p) 
such that g^ = X mod p) . 

The subroutine TwoLevelEstimator makes calls to the following algorithm given in [9]. 

Theorem 1 (Dodis et al. [9l Lemma E.l]). Let p he prime and r = {rx)x&'^ have at most s 
non-zero entries (2s + 1 < p). Given XIxgf* '''^x^ /''^ ^ ^ [^-^l' there is an algorithm to recover 
{{x,rx)\rx / 0} which uses 0{s'^ + s (log s) (log log s) (log p)) field operations over GF{p). ■ 

The proof of correctness of TwoLevelEstimator relies in part on the following lemma. 
Lemma 2 (Bellare and Rompel [3l Lemma 2.3]). Let G [0, 1], I < i < n, be t-wise independent 
fort>4 an even integer, X = YJI=i ^i, and A>Q. Then Pr[|X - E[X]| >A]<S 



Theorem 3. Ignoring the space to store the hash functions /ii,/i2 CLnd tables Ti,T2, the algorithm 
TwoLevelEstimator uses 0( A; log /c) bits of space. The hash functions hi, h2 and tables Ti,T2 re- 
quire an additional 0((log A;)(log n) + /c log^ k) bits. The time to process a stream update is 0{logk). 
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Subroutine TwoLevelEstimator : 

// Compute ll/lli exactly when promised ||/||i < k. The value p is as in Eq. ([1]). 

1. Set t = 2 [log A;] + 12 and s = 2t + [log A;]. Pick a random hi : [n] — > [[/c/log/c]] from a 
t-wise independent hash family and a random /12 : [n] — > [p — 1] from a pairwise independent 
family. 

2. For each j G [[/c/log/c]] maintain 2s counters Xl, . . . jXi^^ modulo p, initialized to 0. 

3. Upon seeing stream update {i,v), increment X^^^*^ by v • {h2{i)Y for z G [2s]. 

4. At the stream's end, for each j £ [[fc/logfc]], attempt to recover the non-zero entries of 
an s-sparse vector f) = {{fj)x)xe¥; satisfying J2xe¥;(ifj)x)x'' = Xi for each z £ [2s] using 
Theorem [TJ 

5. Define a : GF(p) — > Z to be such that a{a) equals a if a < p/2, and equals a — p otherwise. 

Output Efi^'^ E(/,).^ok((/.).)|. 



Figure 1: TwoLevelEstimator subroutine pseudocode 

If Li < k, the final output value of TwoLevelEstimator equals Li exactly with probability at 
least 3/4. 

Proof. Aside from storing hi,h2,Ti,T2, the number of counters is 2s \k/logk~\ = 0{k), each of 
size O(logp) = 0(logA:) bits, totaling 0{klogk) bits. The space to store hi is 0((log A;)(log n)), 
and the space to store /12 is 0(log n) [7]. The tables Ti, T2 each have p—l = 0{k log k) entries, each 
requiring O(logp) = 0(logA;) bits. Processing a stream update requires evaluating hi,h2, taking 
0(log/c) time and 0(1) time, respectively [7]. 

As for update time, each stream token requires updating 2s = O(logfc) counters (Step 3). 
Each counter update can be done in constant time with the help of table lookup since {h2{i)Y = 
gz-d\o^{ho_li)) ^ . T2[h2{i)\) mod {p - 1)]. 

We now analyze correctness. Define / = {i G [n] : /j 7^ at the stream's end}. Note |/| < Li < 
k. For j G [[fc/log /c]], define the random variable Zj = \h^^{j) H I\. We now define two events. 

Let Q be the event that Zj < s = 2t+ [log A;] for all j G [[A;/ log A:]]. 

Let Q' be the event that there do not exist distinct i,i' G / with both hi{i) = hi{i') and 

h2{i) = h2{i'). 

We first argue that, conditioned on both Q, Q' holding, the output of TwoLevelEstimator 
is correct. Note p — 1 > 4s^ [A;/ log A;] > lOOA: log k (recall the definition of s in Step 1 of Figure [1]). 
If Q' occurs, \h2^{i) H h^^{j) D I\ < 1 for all i G [p — 1] and j G [ [A:/ log A:] ] . One can then view 
xi as holding J2xe¥*(^j)^^^ ^ where {rj)x is the frequency (modulo p) of the unique element in the 
set h2^{i) n h^^{j) n I (or if that set is empty). Conditioned on Q, every rj is s-sparse, so we 
correctly recover rj in Step 4 by Theorem [1] since 2s -|- 1 = 5 [log A;] -|- 13 < lOOA; [log A;] < p. Note 
that p is strictly greater than twice the absolute value of the largest frequency since Li < k, and 
thus negative frequencies are strictly above p/2 in GF(p), and positive frequencies are strictly below 
p/2. Thus, given that the rj are correctly recovered, a correctly recovers the actual frequencies in 
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Step 5, implying correctness of the final output. 

Now we proceed to lower bound Pr[Q A Q']. First we show Q occurs with probability at least 
7/8. If we let Zj^i indicate hi{i) = j, then note the random variables {Zj^i}i£j are t-wise independent 
and Zj = Y.iei'^j,i- ^^so, B[Zj] = \I\/ \k/logk] < log A;. Noting 

PrhQ] <'Pr[\Zj -E[Zj]\ > 2t] 

then setting A = 2t and applying Lemma [2l 



Pr[\Z,-E[Z,]\>2t] < 



tE[Zj] +t' 



{2tf 



t/2 



< 



{2t^ 



log fc+6 



< 



8A; 



since E[Zj] < t. A union bound implies Pr[Q] > 7/8. 

Now we analyze Pr[Q'|Q]. Let 1^ be a random variable indicating h2{i) 



/i2(i') and define 



the random variable Y = ^ 
have 



E[y] 



Yi^i'. Note Q is simply the event that Y = 0. We 



[fc/logfcl 



h2{i')] 



< 



< 



Ik/ log k] 

E 

[fc/logfcl 

E 



E[\h-\j)nim 

p — 1 

E[\h^\j)nim 

[fc/log/c] 



where the expectation on the right side of the first equality is over the random choice of hi, and the 
probability is over the random choice of /12. The first inequality holds by pairwise independence 
of /i2. Conditioned on Q, \h^^{j) n I| < s for all j so that E[Y\Q] < 1/8, implying Pr[Q'|Q] = 
1 - Pr[y > 1|Q] > 7/8 by Markov's Inequality. 

In total, we have Pr[Q A Q'] = Pr[Q] • Pr[Q'|Q] > (7/8)^ > 3/4, and the claim is proven. ■ 

Remark 4. In Step 1 of Figure U\ we twice pick a hash function h : [a] ^ [b] from an m-wise 
independent family for some integers m and a ^ b (namely, when picking hi and /12J. However, 
known constructions ^ have a = b, with a a prime power. This is easily circumvented. When we 
desire an h with unequal domain size a and range size b, we can pick a prime I > 2-max{a, 6} then 

pick an m-wise independent hash function h' : [£\ — > [i\ and define h{x) {h'{x) mod 6) + 1. The 
family of such h is still m-wise independent, and by choice of I, no range value is more than twice 
more likely than any other, which suffices for our application with a slight worsening of constant 
factors. 

The following theorem analyzes the pre-processing and post-processing complexity of TwoLeve- 
lEstimator. 
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Theorem 5. Ignoring the time needed to find the prime i in Remark the pre-processing time 
of TwoLevelEstimator before seeing the stream is 0{klogk), and the post-processing time is 
0{k log k log log k log log log k) . 

Proof. We first discuss the pre-processing time. It is known that the prime p and generator g for 
F* can be found in time polylog(C) = polylog(/c) (see the proof of Theorem 4 in [6]). Once we have 
p, filling in Ti,T2 takes 0{p) = 0{klogk) time, which dominates the pre-processing time. The 
time to allocate the 0(k) counters Xl is just 0{k). 

The post-processing work is done in Steps 4 and 5 in Figure [TJ For Step 4, there are 0(fc/log k) 
values of j, for each of which we run the algorithm of Theorem [T] with s = 0(logA;) and p = 
0{k\ogk), thus requiring a total of O (A; log log log /c log log log fc) field operations over GF{p). 
Since we precalculate the table T2, we can do all GF(p) operations in constant time, including 
division. In Step 5 we need to sum the absolute values of 0(\ogk) non-zero entries of 0{k/\ogk) 
vectors fj, taking time 0{k). ■ 

Remark 6. Our subroutine TwoLevelEstimator uses the fact that since Li < k and f is an 
integer vector, it must be the case that Lq <k. From here, what we develop is a /c-set structure as 
defined by Ganguly ]1S^ - which is a data structure that allows one to recover the k-sparse vector f . 
In fact, any k-set structure operating in the turnstile model (i.e., where some fi can be negative) 
would have sufficed in place of TwoLevelEstimator. We develop our particular subroutine since 
previous approaches were either less space- efficient or did not work in the turnstile setting U2\. \13^ . 
We remark that at the cost of an extra 0(log^ k) factor in space, but with the benefit of only 0(1) 
post-processing time, one can replace TwoLevelEstimator with an alternative scheme. Namely, 
for each j S [[A:/ log A;]], attempt to perfectly the hash the 0{logk) coordinates contributing to \ \f\\i 
mapped to j under hi by pairwise independently hashing into 0(log^ k) counters, succeeding with 
constant probability. Each counter holds frequency sums modulo p. By repeating r = 0(logA;) 
times and taking the maximum sum of counter absolute values over any of the r trials, we succeed 
in finding the sum of frequency absolute values of items mapping to j under hi with probability 
1 — 1/ poly(fc). Thus by a union bound, we recover \\f\\i with probability 99/100 by summing up 
over all j. The estimate of \\f\\i can be maintained on the fly during updates to give 0(1) post- 
processing, and updates still take only 0(log A;) time. 

3.2 The Full Algorithm 

Our full algorithm requires, in part, a constant factor approximation to the Li-difference. To obtain 
this, we can use the algorithm of Feigenbaum et al. [11] with e a constant. 

Theorem 7 (Feigenbaum et al. \\.\\ Theorem 12]). There is a one-pass streaming algorithm for (lit 
e)- approximating the Li-difference using 0(e~^ log(nM)) space with update time 0(e~^ log(nM)), 
and succeeding with probability at least 19/20. ■ 

Remark 8. It is stated in that the update time in Theorem^is 0(e~^field(log(nM))), where 
field(L') is the time to do arithmetic over GF(2^) (not including division). Section 2.2 of \1T^ 
points out that field(L') = 0{D^) naively. In fact, it suffices for the purposes of their algorithm to 
work over GF(2^) for the smallest D > log(nM) such that D = 2-3^ , in which case a highly explicit 
irreducible polynomial of degree D over ¥2[x] (namely x^ + x^/'^ + I ^ Theorem 1.1.28]) can be 
used to perform GF(2^) arithmetic in time 0{D) in the word RAM model without any additional 
pre-processing space or time. 
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Main Algorithm LI-Diff: 

1. Set e' = e/8. 

2. Pick a random hash function h : [q] ^ [q] from a pairwise independent family so that 
h{x) = ax + b mod q for some prime q G [2nM, AnM] and a,b £ GF{q). 

3. Initialize instantiations TLEi, . . . , TLE|-iog((£/)2„j^,/)-| of TwoLevelEstimator with k = 
|"4/(e')^] . All instantiations share the same prime p, generator g, hash functions hi,h2, and 
logarithm tables Ti , T2 . 

4. Upon seeing stream update {i,v), let vj be the output of the algorithm from Theorem [9] 
with inputs a, b as in Step 2, c = cj = 2Li°g9j-j^ d = dj = 2Li°g5j-j+i _i^x = {i- 1)M + 1, 
r = |t;| — 1, and m = q. Feed the update (i, sgn(i;)-7;j) to TLEj for j = 1, . . . , |"log((e')^nM)] . 
Let Rj be the output of TLEj. 

5. Run an instantiation TLE of TwoLevelEstimator in parallel with k = which 
receives all updates, using the same /ii, /i2,p, 5, Ti, T2 of Step 2. Let its output be R. 

6. Run the algorithm of Theorem [7] in parallel with error parameter 1/3 to obtain a value 
R' e[Li/2,Li]. 

7. UR'< [l/(e')^l , output R. Otherwise, output g • 2N((^')'«')1 -LioggJ^^j^^^^^,^^^,)^ . 



Figure 2: Ll-DiFF pseudocode 



We also make use of the following algorithm due to Pavan and Tirthapura [25] . 

Theorem 9 (Pavan and Tirthapura [251 Theorem 2]). Let a,b,c,d,x,r,m be integers fitting in 
a machine word with m > and a,b,c,d E {0,... ,m — 1}. There is an algorithm to calculate 
\{i : (a • (x + i) + 6 mod m) G [c, d], < i < r}| in time 0(log(min(a, r))) using 0(log(r • m)) 
space. ■ 

Our main algorithm, which we call Ll-DiFF, is described in Figure [21 Both in Figure [2] and 
in the proof of Theorem [THl sgn denotes the function which takes as input a real number x and 
outputs —1 if x is negative, and 1 otherwise. 

Theorem 10. The algorithm LI-Diff has update time 0(log(e^nM) log(M/e)) and the space used 
is 0(e~^ log(l/e) log(e~^nM)). Pre-processing requires polylog(nM) -|-0(e~^ log(l/e) log(e~^nM)) 
time. Time 0(e~^ log(l/e) log log(l/e) log log log(l/e)) is needed for post-processing. The output is 
(1 lb e)Li with probability at least 2/3. 

Proof. The hash function h requires 0(log(nM)) space. There are 0(log(e^nM)) instanti- 
ations of TwoLevelEstimator (Steps 2 and 4), each with k = 0(e~^), taking a total of 
0(e~^ log(l/e) log(e^nM)) space by Theorem [3| The hash functions /ii,/i2 and tables Ti,T2 take 
0(log(l/e) log(n) + log^(l/e)) = 0(e~^ log(l/e) log n) space, also by Theorem [3] (recah we as- 
sume £ > 1/y/n). Step 6 requires only O (log (nM)) space by Theorem [71 since the algorithm is run 
with error parameter 1/3. 
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As for running time, in Step 3 we call the algorithm of Theorem[9]0(log(e^nM)) times, each time 
with a < q and r < M, thus taking a total of 0(log(e^nM) log(min(g, M))) = 0(log(e^nM) logM) 
time. We must also feed the necessary update to each TLEj, each time taking 0(log(l/e)) time by 
Theorem [3l Updating every TLEj thus takes time 0{log{e^nM)log{l/e)). 

In pre-processing we need to pick a prime q in the desired range, which can be accomplished 
by picking numbers at random and testing primality; the expected time is polylog(nM). We 
also need to prepare hi,h2,Ti,T2 and all the TwoLevelEstimator instantiations, which takes 
0(e~^ log(l/e) log(e^reM)) time by Theorem [5l in addition to the polylog(n) time required to find 
an appropriate prime i as described in Remark HI The pre-processing time for Step 6 is 0(1) (see 
Figure 1 of [Tl]). 

In post-processing we need to recover the estimate R' from Step 6, which takes 0(1) time, then 
recover an estimate from some TwoLevelEstimator instantiation, so the time is as claimed. 
In post-processing, to save time one should not run Steps 4 and 5 of TwoLevelEstimator in 
Figure [1] except at the instantiation whose output is used in Step 7. 

Now we analyze correctness. Let Q be the event that R' G [Li/2,Li]. We proceed by a case 
analysis. 

For the first case, suppose Li < |"l/(e')^]. Then, TLE computes Li exactly with probability 
at least 3/4 by Theorem [3l and hence overall we output Li exactly with probability at least 
(19/20) • (3/4) > 2/3. 

Now, suppose Li > |'l/(e')^]. In analyzing this case, it helps to view LI-Diff as actually 

computing Lo{f') =^ |{i : 7^ 0}|, where we consider an nM-dimensional vector /' that is being 
updated as follows: when receiving an update (i, v) in the stream, we conceptually view this update 
as being \v\ updates ((i — 1)M -|- l,sgn(f)), . . . , ((i — 1)M + |f |,sgn(f)) to the vector /'. Here, the 
vector /' is initialized to 0. Note that at the stream's end, L^^f) = ||/||i. 

Let f'^ denote the vector whose zth entry, i S [mM], is // if h{i) G [cj,dj\ and otherwise. 
That is, f'^ receives stream updates only from items fed to TLEj. For i G [nM], let Xij be a 
random variable indicating h{i) G [cj,(ij], and let Xj = ^f^Q Xij so that Xj = Lq^J'^). Define 

pj =^ {dj - Cj + l)/q = 2Li°g''J-i/g so that E[Xij] = pj. Thus, E[Xj] = pj ■ Lo{f'). Note that 
1/2 < 2^^°^'^-! /q < I. Conditioned on Q, we have the inequalities 

Loif) ^ Loif) ^ 2 



2riog({e')2i?')l - {e')^R' - (e')2 
and 

Loif) ^ Loif) ^ 1 



By the choice of j = |'log((e')^i?')] in Step 7 of Figure El we thus have, assuming Q occurs, 

- 4(?F - ^^^^^ - W 

since E[X,] = p, ■ Lo{f') = (2Liog'?J/g) • (Lo(/')/2^). 

Let Q' be the event that \Xj — E[Xj]| < eE[Xj]. Applying Chebyshev's inequality, 

VarlXA 1 15 
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The second inequality holds since h is pairwise independent and Xj is the sum of Bernoulli random 
variables, implying Var[Xj] = Var[Xjj] < X]jE[Xjj] = E[Xj]. The last inequality holds by 
choice of e' = e/8. 

Let Q" be the event that TLEj outputs Xj correctly. Now, conditioned on Q A Q' , we have 
Xj < 2(1 + e)/(e')^ < 4/(e')2 since e < 1. Thus by Theorem O Pr[Q"|Q A Q'] > 3/4. Overall, we 
compute Li of the entire stream correctly with probability at least 

Pr[QAQ'AQ"] = Pr[Q]-Pr[Q'\Q]-Pr[Q"\QAQ'] 
> (19/20) • (15/16) • (3/4) > 2/3 

■ 

Our streaming algorithm also gives a sketching procedure. This is because, as long as Alice and 
Bob share randomness, they can generate the same h, hi,h2,p,g then separately apply the streaming 
algorithm to their vectors x, y. The sketch is then just the state of the streaming algorithm's data 
structures. Since each stream token causes only linear updates to counters, a third party can then 
take the counters from Bob's sketch and subtract them from Alice's, then do post-processing to 
recover the estimation of the Li -difference. The running time for Alice and Bob to produce their 
sketches is the streaming algorithm's pre-processing time, plus n times the update time. The time 
for the third party to obtain an approximation to ||x — y||i is the time required to combine the 
sketches, plus the post-processing time. We thus have the following theorem. 

Theorem 11. Sharing polylog(nM) randomness, two parties Alice and Bob, holding vectors x,y £ 
{— M,...,M}", respectively, can produce 0(e~^ log(l/e) log(e^nM))-6ii sketches s{x),s{y) such 
that a third party can recover — y||i to within (lie) with probability at least 2/3 given only 
s{x),s{y). Each of Alice and Bob use time 0(nlog(e^nM) log(M/e)) to produce their sketches. In 
0(e~^(log(e^nM) + log(l/e) log log(l/e) log loglog(l/e))) time, the third party can recover \\x — y\\i 
to within a multiplicative factor o/ (1 it e). ■ 

Note Alice and Bob's running time is always 0(nlog^(nM)) since e > Xj^fn. 

Remark 12. Though we assume Alice and Bob share randomness, to actually implement our 
algorithm in practice this randomness must be communicated at some point. We note that while 
the sketch length guaranteed by Theorem\T^is 0(e~^ log(l/e) log(e^nM)) bits, the required amount 
of shared randomness is polylog(nM), which for large enough e is larger than the sketch length. 
This is easily fixed though. Since the required randomness is only polynomially larger than the 
space used by the sketching algorithm (which is asymptotically equal to the sketch length), the two 
parties can use the Nisan-Zuckerman pseudorandom generator \24^ to stretch a seed whose length 
is linear in the sketch length to a pseudorandom string of length polylog(nM) which still provides 
the guarantees of Theorem\ll[ Alice and Bob then only need to communicate this random seed. 
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